Data Joins + Transformations

Monday, April 24

Today we will…

Lab 3: Familiar Words – Sketch it out!

“For each demographic group listed below, determine all words in this study that were the most and least familiar, on average.”

Data Layouts

Tidy Data

Tidy data…

  • is rectangular.
  • has variables as rows and observations as columns.
  • has different formats for different tasks.

R4DS

Consequences of Messy Data

Illustration by Allison Horst

  • Tidy: use the same tools in similar ways for different datasets.
  • Messy: create unique tools that are difficult to generalize.

Creating Tidy Data

We may need to transform our data to turn it into the version of tidy that is best for a task at hand.

Illustration by Allison Horst

Creating Tidy Data

We want to look at mean cereal nutrients based on shelf.

  • The data are in a wide format – a separate column for each nutrient.
  • Transforming the data will make plotting easier.
library(liver)
data(cereal)
head(cereal)
name manuf type calories protein fat sodium fiber carbo sugars potass vitamins shelf weight cups rating
100% Bran N cold 70 4 1 130 10.0 5.0 6 280 25 3 1 0.33 68.40297
100% Natural Bran Q cold 120 3 5 15 2.0 8.0 8 135 0 3 1 1.00 33.98368
All-Bran K cold 70 4 1 260 9.0 7.0 5 320 25 3 1 0.33 59.42551
All-Bran with Extra Fiber K cold 50 4 0 140 14.0 8.0 0 330 25 3 1 0.50 93.70491
Almond Delight R cold 110 2 2 200 1.0 14.0 8 -1 25 3 1 0.75 34.38484
Apple Cinnamon Cheerios G cold 110 2 2 180 1.5 10.5 10 70 25 1 1 0.75 29.50954

Creating Tidy Data

Code
cereal_wide <- cereal |> 
  group_by(shelf) |> 
  summarise(across(calories:vitamins, mean))
shelf calories protein fat sodium fiber carbo sugars potass vitamins
1 102.5000 2.650000 0.60 176.2500 1.6850000 15.80000 4.800000 75.50000 20.00000
2 109.5238 1.904762 1.00 145.7143 0.9047619 13.61905 9.619048 57.80952 23.80952
3 107.7778 2.861111 1.25 158.6111 3.1388889 14.50000 6.527778 129.83333 35.41667
Code
my_colors <- c("calories_col" = "steelblue", "sugars_col" = "orange3")

cereal_wide |> 
  ggplot() +
  geom_point(aes(x = shelf, y = calories, color = "calories_col")) +
  geom_line(aes(x = shelf, y = calories, color = "calories_col")) + 
  geom_point(aes(x = shelf, y = sugars, color = "sugars_col")) +
  geom_line(aes(x = shelf, y = sugars, color = "sugars_col")) +
  scale_color_manual(values = my_colors, labels = names(my_colors)) +
  labs(x = "Shelf", y = "", subtitle = "Mean Amount", color = "Nutrient")

Code
cereal_long<- cereal |> 
  pivot_longer(cols = calories:vitamins,
               names_to = "Nutrient",
               values_to = "Amount") |> 
  group_by(shelf, Nutrient) |> 
  summarise(mean_amount = mean(Amount))
shelf Nutrient mean_amount
1 calories 102.5000000
1 carbo 15.8000000
1 fat 0.6000000
1 fiber 1.6850000
1 potass 75.5000000
1 protein 2.6500000
1 sodium 176.2500000
1 sugars 4.8000000
1 vitamins 20.0000000
2 calories 109.5238095
2 carbo 13.6190476
2 fat 1.0000000
2 fiber 0.9047619
2 potass 57.8095238
2 protein 1.9047619
2 sodium 145.7142857
2 sugars 9.6190476
2 vitamins 23.8095238
3 calories 107.7777778
3 carbo 14.5000000
3 fat 1.2500000
3 fiber 3.1388889
3 potass 129.8333333
3 protein 2.8611111
3 sodium 158.6111111
3 sugars 6.5277778
3 vitamins 35.4166667
Code
cereal_long |> 
  ggplot(aes(x = shelf, 
             y = mean_amount, 
             color = Nutrient)) +
  geom_point() +
  geom_line() +
  labs(x = "Shelf", y = "", subtitle = "Mean Amount")

Pivoting Data

Tidyexpalin animation by Kelsey Gonzalez

Manual Method

Consider daily rainfall observed in SLO in January 2023.

  • The data is in a human-friendly form (like a calendar).
  • Each week has a row, and each day has a column.

Data source

How would you manually convert this to long format?

Manual Method: Steps

  1. Create a new column: Day_of_Week.

  2. Create a new column: Rainfall (hold daily rainfall values).

  3. Now we have three columns (Week, Day_of_Week, and Rainfall) – start moving Sunday values over.

  4. Duplicate Week 1-5 and copy Monday values over.

  5. Duplicate Week 1-5 and copy Tuesday values over.

  6. Continue for the rest of the days of the week.

  7. You may want to arrange() by Week to get the rainfall values chronological order.

Computational Approach

We can use pivot_longer() to turn a wide dataset into a long(er) dataset.

pivot_longer()

Take a wide dataset and turn it into a long daaset.

  • cols – specify the columns that should be pivoted.
    • Do not include the names of ID columns (columns to not be pivoted).
  • names_to – the name of the new column containing the old column names.
  • values_to – the name of the new column containing the old column values.

pivot_longer()

library(readxl)
slo_rainfall <- read_xlsx("data/2023-rainfall-slo.xlsx")

slo_rainfall |> 
  mutate(across(Sunday:Saturday, as.numeric)) |> 
  pivot_longer(cols      = Sunday:Saturday,
               names_to  = "Day_of_Week",
               values_to = "Daily_Rainfall")
Week Day_of_Week Daily_Rainfall
1 Sunday 0.00
1 Monday 0.12
1 Tuesday 0.00
1 Wednesday 1.58
1 Thursday 0.91
1 Friday 0.00
1 Saturday 0.05
2 Sunday 0.27
2 Monday 4.26
2 Tuesday 0.43
2 Wednesday 0.00
2 Thursday 0.00
2 Friday 0.16
2 Saturday 1.41
3 Sunday 0.34
3 Monday 0.33
3 Tuesday 0.00
3 Wednesday 0.00
3 Thursday 0.13
3 Friday 0.00
3 Saturday 0.00
4 Sunday 0.00
4 Monday 0.00
4 Tuesday 0.00
4 Wednesday 0.00
4 Thursday 0.00
4 Friday 0.00
4 Saturday NA
5 Sunday NA
5 Monday NA
5 Tuesday NA
5 Wednesday NA
5 Thursday NA
5 Friday NA
5 Saturday NA

pivot_wider()

Take a long dataset and turn it into a wide daaset.

  • id_cols – specify the column(s) that contain the ID for unique rows in the wide dataset.
  • names_from – the name of the column containing the new column names.
  • values_from – the name of the column containing the new column values.

pivot_wider()

We calculate the mean amount of protein for cereals on each shelpf and for each manufacturer.

mean_protein <- cereal |> 
  group_by(manuf, shelf) |> 
  summarize(mean_protein = mean(protein))
manuf shelf mean_protein
A 2 4.000000
G 1 3.000000
G 2 1.285714
G 3 2.666667
K 1 2.750000
K 2 2.142857
K 3 2.916667
N 1 2.666667
N 2 2.500000
N 3 4.000000
P 1 1.500000
P 2 1.000000
P 3 3.000000
Q 1 5.000000
Q 2 2.000000
Q 3 2.500000
R 1 2.000000
R 3 3.000000

pivot_wider()

mean_protein |> 
  arrange(shelf) |> 
  pivot_wider(id_cols = manuf,
              names_from = shelf,
              values_from = mean_protein)
manuf 1 2 3
G 3.000000 1.285714 2.666667
K 2.750000 2.142857 2.916667
N 2.666667 2.500000 4.000000
P 1.500000 1.000000 3.000000
Q 5.000000 2.000000 2.500000
R 2.000000 NA 3.000000
A NA 4.000000 NA

Better names in pivot_wider()

mean_protein |> 
  arrange(shelf) |> 
  pivot_wider(id_cols = manuf,
              names_from = shelf,
              values_from = mean_protein,
              names_prefix = "Shelf_")
manuf Shelf_1 Shelf_2 Shelf_3
G 3.000000 1.285714 2.666667
K 2.750000 2.142857 2.916667
N 2.666667 2.500000 4.000000
P 1.500000 1.000000 3.000000
Q 5.000000 2.000000 2.500000
R 2.000000 NA 3.000000
A NA 4.000000 NA

Data Joins

Relational Data

Multiple, interconnected tables of data are called relational.

  • It is the relation between data sets, not just the individual data sets themselves, that are important.

IMDb movie relational data

Data Joins

We can combine (join) data tables based on their relations.

Mutating joins

Add variables from a new dataframe to observations in an existing dataframe.

full_join(), left_join(), right_join(), inner_join(), outer_join()

Filtering Joins

Filter observations based on values in new dataframe.

semi_join(), anti_join()

Keys

A key uniquely identifies an observation in a data set.

  • To combine (join) two datasets, the key needs to be present in both.

inner_join()

Keeps obsertvations when their keys are present in both datasets.

inner_join(): IMDb Example

directors_genres
director_id genre prob
429 Adventure 0.750000
429 Fantasy 0.750000
2931 Drama 0.714286
2931 Action 0.428571
11652 Sci-Fi 0.500000
11652 Action 0.500000
14927 Animation 1.000000
14927 Family 1.000000
15092 Comedy 0.545455
15092 Crime 0.545455
movies_directors
director_id movie_id
429 300229
9247 124110
11652 10920
11652 333856
14927 192017
15092 109093
15092 237431

ID: 429, 2931, 11652, 14927, 15092       ID: 429, 9247, 11652, 14927, 15092

inner_join(directors_genres, movies_directors)
director_id genre prob movie_id
429 Adventure 0.750000 300229
429 Fantasy 0.750000 300229
11652 Sci-Fi 0.500000 10920
11652 Sci-Fi 0.500000 333856
11652 Action 0.500000 10920
11652 Action 0.500000 333856
14927 Animation 1.000000 192017
14927 Family 1.000000 192017
15092 Comedy 0.545455 109093
15092 Comedy 0.545455 237431
15092 Crime 0.545455 109093
15092 Crime 0.545455 237431

ID: 429, 2931, 9247, 11652, 14927, 15092

inner_join(): IMDb Example

What if our key does not have the same name?

directors_genres
director_id genre prob
429 Adventure 0.750000
429 Fantasy 0.750000
2931 Drama 0.714286
2931 Action 0.428571
11652 Sci-Fi 0.500000
11652 Action 0.500000
14927 Animation 1.000000
14927 Family 1.000000
15092 Comedy 0.545455
15092 Crime 0.545455
directors
id first_name last_name
429 Andrew Adamson
9247 Zach Braff
11652 James (I) Cameron
14927 Ron Clements
15092 Ethan Coen
inner_join(directors_genres, 
           directors, 
           by = c("director_id" = "id"))
id first_name last_name genre prob
429 Andrew Adamson Adventure 0.750000
429 Andrew Adamson Fantasy 0.750000
11652 James (I) Cameron Sci-Fi 0.500000
11652 James (I) Cameron Action 0.500000
14927 Ron Clements Animation 1.000000
14927 Ron Clements Family 1.000000
15092 Ethan Coen Comedy 0.545455
15092 Ethan Coen Crime 0.545455

Mutating Joins

  • left_join() – keep only (and all) observations in the left data set

  • right_join() – keep only (and all) observations in the right data set

  • full_join() – keep all observations in both data sets

Mutating Joins

Which directors would remain for each of the following?

  • left_join(directors_genres, movies_directors)
  • right_join(directors_genres, movies_directors)
  • full_join(directors_genres, movies_directors)

directors_genres |> 
  distinct(director_id)
director_id
429
2931
11652
14927
15092
movies_directors |> 
  distinct(director_id)
director_id
429
9247
11652
14927
15092

Filtering Joins: semi_join()

Keeps observations when their keys are present in both datasets, but only keeps variables from the first dataset.


→  

Filtering Joins: semi_join()

semi_join(directors_genres, movies_directors)
director_id genre prob
429 Adventure 0.750000
429 Fantasy 0.750000
11652 Sci-Fi 0.500000
11652 Action 0.500000
14927 Animation 1.000000
14927 Family 1.000000
15092 Comedy 0.545455
15092 Crime 0.545455

Movie Directors: 429, 2931, 11652, 14927, 15092

directors_genres |>
  filter(director_id %in% movies_directors$director_id)
director_id genre prob
429 Adventure 0.750000
429 Fantasy 0.750000
11652 Sci-Fi 0.500000
11652 Action 0.500000
14927 Animation 1.000000
14927 Family 1.000000
15092 Comedy 0.545455
15092 Crime 0.545455

Filtering Joins: anti_join()

Removes observations when their keys are present in both datasets, and only keeps variables from the first dataset.


→  


Filtering Joins: anti_join()

anti_join(directors_genres, movies_directors)
director_id genre prob
2931 Drama 0.714286
2931 Action 0.428571

Movie Directors: 429, 2931, 11652, 14927, 15092

directors_genres |>
  filter(!director_id %in% movies_directors$director_id)
director_id genre prob
2931 Drama 0.714286
2931 Action 0.428571

Piping Joins

Remember: the dataset you pipe in becomes the first argument of the function you are piping into!


inner_join(directors_genres, movies_directors)

…is equivalent to…

directors_genres |> 
  inner_join(movies_directors)

PA 4: Military Spending

Today you will be tidying messy data to explore the relationship between countries of the world and military spending.

  • Due Wednesday, 4/26 at 10:00am

Bonus Challenge: Murder in SQL City

For this challenge, you will be using table joins to solve a murder mystery!

  • Due Monday, 5/8 at 11:59pm

To do…

  • PA 4: Military Spending
    • Due Wednesday, 4/26 at 10:00am
  • Bonus Challenge: Murder Mystery in SQL City
    • Due Monday, 5/8 at 11:59pm

Wednesday, January 25th

Today we will…

  • Review Lab 3
  • Review PA 4: Military Spending
  • Housekeeping items…
    • Providing References in Labs + Challenges
    • Clean variable names
    • Lifecycle stages
    • dplyr package updates
    • Saving & Piping Data Joins
  • Extensions to Relational Data
  • Lab 4: Avocado Prices
  • Challenge 3: Avocado Toast Ate My Mortgage

Getting Help and using Chat GPT


Lab 3: Distinct number of words


count(distinct(hiphop_clean, word))


hiphop_clean |> 
  distinct(word) |> 
  count()


\(f(g(h(x)))\)

in piping syntax is

x |> h() |> g() |> f()

n_distinct(hiphop_clean$word)


hiphop_clean |> 
  pull(word) |> 
  n_distinct()

Clean variable names with library(janitor)

Converts all names of variables in a data set to snake_case.

names(military)
 [1] "Country"        "Notes"          "Reporting year" "1988"          
 [5] "1989"           "1990"           "1991"           "1992"          
 [9] "1993"           "1994"           "1995"           "1996"          
[13] "1997"           "1998"           "1999"           "2000"          
[17] "2001"           "2002"           "2003"           "2004"          
[21] "2005"           "2006"           "2007"           "2008"          
[25] "2009"           "2010"           "2011"           "2012"          
[29] "2013"           "2014"           "2015"           "2016"          
[33] "2017"           "2018"           "2019"          
library(janitor)
military_clean_names <- military |> 
  clean_names()

names(military_clean_names)
 [1] "country"        "notes"          "reporting_year" "x1988"         
 [5] "x1989"          "x1990"          "x1991"          "x1992"         
 [9] "x1993"          "x1994"          "x1995"          "x1996"         
[13] "x1997"          "x1998"          "x1999"          "x2000"         
[17] "x2001"          "x2002"          "x2003"          "x2004"         
[21] "x2005"          "x2006"          "x2007"          "x2008"         
[25] "x2009"          "x2010"          "x2011"          "x2012"         
[29] "x2013"          "x2014"          "x2015"          "x2016"         
[33] "x2017"          "x2018"          "x2019"         

Lifceycle stages

Learn more about lifecycle stages of packages, functions, function arguments in R.

Image source: deanattali.com/blog/cranalerts/

Deprecated Functions: Military Spending Example

military_clean |> 
  filter(across(Notes:`2019`, is.na)) |> 
  slice_head(n = 3)
Warning: Using `across()` in `filter()` was deprecated in dplyr 1.0.8.
ℹ Please use `if_any()` or `if_all()` instead.
# A tibble: 3 × 35
  Country      Notes `Reporting year` `1988` `1989` `1990` `1991` `1992` `1993`
  <chr>        <chr> <chr>            <chr>  <chr>  <chr>  <chr>  <chr>  <chr> 
1 Africa       <NA>  <NA>             <NA>   <NA>   <NA>   <NA>   <NA>   <NA>  
2 North Africa <NA>  <NA>             <NA>   <NA>   <NA>   <NA>   <NA>   <NA>  
3 Sub-Saharan  <NA>  <NA>             <NA>   <NA>   <NA>   <NA>   <NA>   <NA>  
# ℹ 26 more variables: `1994` <chr>, `1995` <chr>, `1996` <chr>, `1997` <chr>,
#   `1998` <chr>, `1999` <chr>, `2000` <chr>, `2001` <chr>, `2002` <chr>,
#   `2003` <chr>, `2004` <chr>, `2005` <chr>, `2006` <chr>, `2007` <chr>,
#   `2008` <chr>, `2009` <chr>, `2010` <chr>, `2011` <chr>, `2012` <chr>,
#   `2013` <chr>, `2014` <chr>, `2015` <chr>, `2016` <chr>, `2017` <chr>,
#   `2018` <chr>, `2019` <chr>
military_clean |>
  filter(if_all(Notes:`2019`, ~ is.na(.x))) |> 
  slice_head(n = 3)
# A tibble: 3 × 35
  Country      Notes `Reporting year` `1988` `1989` `1990` `1991` `1992` `1993`
  <chr>        <chr> <chr>            <chr>  <chr>  <chr>  <chr>  <chr>  <chr> 
1 Africa       <NA>  <NA>             <NA>   <NA>   <NA>   <NA>   <NA>   <NA>  
2 North Africa <NA>  <NA>             <NA>   <NA>   <NA>   <NA>   <NA>   <NA>  
3 Sub-Saharan  <NA>  <NA>             <NA>   <NA>   <NA>   <NA>   <NA>   <NA>  
# ℹ 26 more variables: `1994` <chr>, `1995` <chr>, `1996` <chr>, `1997` <chr>,
#   `1998` <chr>, `1999` <chr>, `2000` <chr>, `2001` <chr>, `2002` <chr>,
#   `2003` <chr>, `2004` <chr>, `2005` <chr>, `2006` <chr>, `2007` <chr>,
#   `2008` <chr>, `2009` <chr>, `2010` <chr>, `2011` <chr>, `2012` <chr>,
#   `2013` <chr>, `2014` <chr>, `2015` <chr>, `2016` <chr>, `2017` <chr>,
#   `2018` <chr>, `2019` <chr>

dplyr updates

There was an update to the dplyr package to Version 1.1.0 on Sunday 1/29/2023.

See changelog for updates

group_by() vs .by = argument

across()

military |> 
  mutate(across(`1988`:`2019`, na_if, y = ". ."))


military |> 
  mutate(across(`1988`:`2019`, ~ na_if(.x, y = ". .")))

join_by()

Saving & Piping data joins

genres_movies_joined <- inner_join(directors_genres, movies_directors)

genres_movies_joined <- directors_genres |> 
  inner_join(movies_directors)

Extensions to Relational Data

IMDb Movies Data

Multiple tables of data are called relational data because it is the relations, not just the individual data sets, that are important.

Keys:

  • Uniquely identifies an observation in a data set
  • Relate data sets to each other

Joining Multiple Data Sets

movies_directors
# A tibble: 41 × 2
   director_id movie_id
         <dbl>    <dbl>
 1         429   300229
 2        2931   254943
 3        9247   124110
 4       11652    10920
 5       11652   333856
 6       14927   192017
 7       15092   109093
 8       15092   237431
 9       15093   109093
10       15093   237431
# ℹ 31 more rows
directors
# A tibble: 34 × 3
      id first_name   last_name
   <dbl> <chr>        <chr>    
 1   429 Andrew       Adamson  
 2  2931 Darren       Aronofsky
 3  9247 Zach         Braff    
 4 11652 James (I)    Cameron  
 5 14927 Ron          Clements 
 6 15092 Ethan        Coen     
 7 15093 Joel         Coen     
 8 15901 Francis Ford Coppola  
 9 15906 Sofia        Coppola  
10 16816 Cameron      Crowe    
# ℹ 24 more rows
movies
# A tibble: 36 × 4
       id name             year  rank
    <dbl> <chr>           <dbl> <dbl>
 1  10920 Aliens           1986  8.20
 2  17173 Animal House     1978  7.5 
 3  18979 Apollo 13        1995  7.5 
 4  30959 Batman Begins    2005 NA   
 5  46169 Braveheart       1995  8.30
 6 109093 Fargo            1996  8.20
 7 111813 Few Good Men, A  1992  7.5 
 8 112290 Fight Club       1999  8.5 
 9 116907 Footloose        1984  5.80
10 124110 Garden State     2004  8.30
# ℹ 26 more rows

join_one_data <- movies_directors |> 
  inner_join(directors, 
             by = c("director_id" = "id")
             )
join_one_data
# A tibble: 41 × 4
   director_id movie_id first_name last_name
         <dbl>    <dbl> <chr>      <chr>    
 1         429   300229 Andrew     Adamson  
 2        2931   254943 Darren     Aronofsky
 3        9247   124110 Zach       Braff    
 4       11652    10920 James (I)  Cameron  
 5       11652   333856 James (I)  Cameron  
 6       14927   192017 Ron        Clements 
 7       15092   109093 Ethan      Coen     
 8       15092   237431 Ethan      Coen     
 9       15093   109093 Joel       Coen     
10       15093   237431 Joel       Coen     
# ℹ 31 more rows

If you update dplyr

directors_movie_ids <- movies_directors |> 
  inner_join(directors, 
             by = join_by(id == director_id)
             )
join_two_data <- movies_directors |> 
  inner_join(directors, 
             by = c("director_id" = "id")
             ) |> 
  inner_join(movies,
             by = c("movie_id" = "id")
             ) |> 
  rename(movie_name = name)
join_two_data
# A tibble: 41 × 7
   director_id movie_id first_name last_name movie_name               year  rank
         <dbl>    <dbl> <chr>      <chr>     <chr>                   <dbl> <dbl>
 1         429   300229 Andrew     Adamson   Shrek                    2001  8.10
 2        2931   254943 Darren     Aronofsky Pi                       1998  7.5 
 3        9247   124110 Zach       Braff     Garden State             2004  8.30
 4       11652    10920 James (I)  Cameron   Aliens                   1986  8.20
 5       11652   333856 James (I)  Cameron   Titanic                  1997  6.90
 6       14927   192017 Ron        Clements  Little Mermaid, The      1989  7.30
 7       15092   109093 Ethan      Coen      Fargo                    1996  8.20
 8       15092   237431 Ethan      Coen      O Brother, Where Art T…  2000  7.80
 9       15093   109093 Joel       Coen      Fargo                    1996  8.20
10       15093   237431 Joel       Coen      O Brother, Where Art T…  2000  7.80
# ℹ 31 more rows

Joining on Multiple Variables

Using the hiphop data from Lab 3…

music
# A tibble: 10,752 × 6
   word          subj   folk  rock country   pop
   <chr>         <chr> <dbl> <dbl>   <dbl> <dbl>
 1 [to be] ghost p15       0     1       3     4
 2 [to be] ghost p53       0     0       1     3
 3 [to be] ghost p136      0     0       0     0
 4 [to be] ghost p36       1     1       2     2
 5 [to be] ghost p118      0     1       0     3
 6 [to be] ghost p106      0     0       1     1
 7 [to be] ghost p70       2     0       0     0
 8 [to be] ghost p107      0     1       3     1
 9 [to be] ghost p102      0     0       0     0
10 [to be] ghost p83       0     4       4     4
# ℹ 10,742 more rows
familiarity
# A tibble: 10,752 × 3
   word          participant familiarity
   <chr>         <chr>             <dbl>
 1 [to be] ghost p15                   1
 2 [to be] ghost p53                   1
 3 [to be] ghost p136                  1
 4 [to be] ghost p36                   1
 5 [to be] ghost p118                  1
 6 [to be] ghost p106                  1
 7 [to be] ghost p70                   1
 8 [to be] ghost p107                  1
 9 [to be] ghost p102                  5
10 [to be] ghost p83                   1
# ℹ 10,742 more rows

{.r .cell-code code-line-numbers="3"-4""} music_wordfam <- music |> full_join(familiarity, by = c("word" = "word", "subj" = "participant") ) music_wordfam

# A tibble: 10,752 × 7
   word          subj   folk  rock country   pop familiarity
   <chr>         <chr> <dbl> <dbl>   <dbl> <dbl>       <dbl>
 1 [to be] ghost p15       0     1       3     4           1
 2 [to be] ghost p53       0     0       1     3           1
 3 [to be] ghost p136      0     0       0     0           1
 4 [to be] ghost p36       1     1       2     2           1
 5 [to be] ghost p118      0     1       0     3           1
 6 [to be] ghost p106      0     0       1     1           1
 7 [to be] ghost p70       2     0       0     0           1
 8 [to be] ghost p107      0     1       3     1           1
 9 [to be] ghost p102      0     0       0     0           5
10 [to be] ghost p83       0     4       4     4           1
# ℹ 10,742 more rows

if dplyr updated…

music_wordfam <- music |> 
  full_join(familiarity,
            by = join_by(word == word, 
                         subj == participant)
            )
music_wordfam 

Lab + Challenge

Lab 4: Avocado Prices + Challenge 4: Avocado Toast Ate My Mortgage


Handy Helpers

rename() – Change names of columns

separate() – Separate values of a variable


Filtering Joins

semi_join(): Keeps values found in another data set

anti_join(): Keeps values not found in another data set

Workflow

  1. Load packages + read in original data
library(tidyverse)
data_original <- read_csv(file = "path/to/datal.csv")
  1. Clean data – save your changes! This is now your new “master” data set
data_clean <- data_original |> 
  mutate(across(x1:x5, ~ as.factor(.x))) |> 
  mutate(new_var <- if_else(...))
  1. If you need subsets, create those from your new “master” data for the specific tasks.
demographics_subj <- data_clean |> 
  distinct(subj, keep_all = TRUE)
  1. Output only the the information you want to include in your assignment.

To do…

  • Lab 4: Avocado Prices
    • Due Friday, 2/3 at 11:59pm
  • Challenge 4: Avocado Toast Ate My Mortgage
    • Due Saturday, 2/4 at 11:59pm
  • Read Chapter 5: Special Data Types
    • Concept Check 5.1 + 5.2 + 5.3 due Monday (2/6) at 8:00am
  • Bonus Challenge: Murder Mystery in SQL City
    • Due Sunday 2/12 at 11:59pm